Sentence boundary detection using sequential dependency analysis combined with CRF-based chunking

نویسندگان

  • Takanobu Oba
  • Takaaki Hori
  • Atsushi Nakamura
چکیده

In spoken language, sentence boundaries are much less explicit than in written language. Since conventional natural language processing (NLP) techniques are generally designed assuming the sentence boundaries are already given, it is crucial to detect the boundaries accurately for applying such NLP techniques to spoken language. Classification frameworks, such as Support Vector Machines (SVMs) and Conditional Random Fields (CRFs), can be used to detect the boundaries. With these methods, the sentence boundaries are determined based on local sentence-end-like word sequences around the boundaries. However, the methods do not evaluate whether or not each block determined by the boundaries is appropriate as a sentence. We have proposed sequential dependency analysis (SDA), which extracts the dependency structure of unsegmented word sequences with a subsidiary mechanism of sentence boundary detection. In this paper, we extend SDA by combining it with CRFs to reflect both the properties of local word sequences and the appropriateness as a sentence. In this way we achieve more accurate sentence boundary detection. The experimental result shows that our proposed method provides better detection accuracy than that obtained with SVMs or CRFs alone. Our method can also work sequentially because it is based on the SDA framework and can be used for on-line spoken applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A deep neural network approach for sentence boundary detection in broadcast news

This paper presents a deep neural network (DNN) approach to sentence boundary detection in broadcast news. We extract prosodic and lexical features at each inter-word position in the transcripts and learn a sequential classifier to label these positions as either boundary or non-boundary. This work is realized by a hybrid DNN-CRF (conditional random field) architecture. The DNN accepts prosodic...

متن کامل

Dependency structure analysis and sentence boundary detection in spontaneous Japanese

This paper addresses automatic detection of dependencies between Japanese phrasal units called bunsetsus, and sentence boundaries in a spontaneous speech corpus. In spontaneous speech, the biggest problem with dependency structure analysis is that sentence boundaries are ambiguous. In this paper, we propose two methods for improving the accuracy of sentence boundary detection in spontaneous Jap...

متن کامل

Japanese Dependency Analysis using Cascaded Chunking

In this paper, we propose a new statistical Japanese dependency parser using a cascaded chunking model. Conventional Japanese statistical dependency parsers are mainly based on a probabilistic model, which is not always efficient or scalable. We propose a new method that is simple and efficient, since it parses a sentence deterministically only deciding whether the current segment modifies the ...

متن کامل

Extraction of Drug-Drug Interaction from Literature through Detecting Linguistic-based Negation and Clause Dependency

Extracting biomedical relations such as drug-drug interaction (DDI) from text is an important task in biomedical NLP. Due to the large number of complex sentences in biomedical literature, researchers have employed some sentence simplification techniques to improve the performance of the relation extraction methods. However, due to difficulty of the task, there is no noteworthy improvement in t...

متن کامل

Three Types of Chunking in Korean and Dependency Analysis Based on Lexical Association

The curtailment of disambiguation decisions is crucial for eecient and precise analysis of sentences in the view of parsing as making a sequence of disambiguation. In this paper we propose three types of chunking in Korean for purpose of the reduction of search space. We present the parsing method based on chunking and the association among chunks and words in a chunk. Test was conducted on 237...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006